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Abstract 



A novel text analysis and characterisation method 
involves the generation from text samples of sets of variable- 
length character strings. These sets are intermediate in number 
between the character set and the total nximber of words in a 
data-base; their distribution is less disparate than those of 
either characters or words. The size of the sets of character 
strings (key-sets) can be varied arbitrarily by changing parameters. 

The characteristics of three scientific data-bases 
(two disciplinary, one interdisciplinary) are compared in terms 
of key-sets of different sizes. Application of the key-sets for 
file compression, using a variable to fixed-length coding 
strategy, is discussed. 



Introduction. 



o 

ERIC 



Shannon tells us that the set of symbols ideal for 

economy in mechanical storage and transmission of information 

is one in which the symbols are equiprobable. In that case, the 

value for the entropy, as given by the expression 

i 

- H = 2 p^ logg p^ 
isl 

reaches a maximum. This value is the binary logarithm of the 
total number of symbols in the set, i.e., their variety. Since, 
in mechanical systems, symbols are most conveniently represented 
by fixed-length binary patterns, it is natural to consider 
symbol sets which are eq.ual in manber to integral powers of two. 
A series of such ideal sets can be represented as in Figure 1. 
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Figure 1. Distributions of ideal symbol sets. 
The two dominant features are that the distribution is rectangular, 
and that the value of the entropy is determined by the variety 
of sjrmbols, however defined. 

Symbol sets with such ideal distributions are most 

uncommon in natural circumstances; much more typical is a 

2 

hyperbolic distribution , such as that displayed by the char- 
acters of the titles of articles included in Chemical Titles. 
as shown in Figure 2. 

The most common approach to converting a hyperbolic 

Ustribution to a rectangular one is exemplified by schemes 
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Figure 2, Distribution of characters in 1000 titles 

from Chemical Titles. 
introduced by Shannon, Pano^ and by Huffman*. These involve 
talcing a fixed-length segment of text, and reprecenting it by 
means of a variable-length code, the length of the code being 
inversely related to the frequency of the symbol. This can be 
shown - if notionally - by the diagram of Figure 3. It involves 
a flxed-to-variable length transformation, which is merely one 
of three possible strategies, che others being variable-to- 
fixed length, and variable-to-variable length. The second is 

known in the context of run-length coding, a method used for 
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Figure 3* Notional mapping of hyperbolic onto rect- 
angular distribution by means of f ixed-to-vftridble 

length transformation , 
g 

much study, although Walker has 'xaed it to compT/ess Christian 
names taken from early English parish registers. 

If we consider a hyperbolic symbol distribution, the 
disparate frequencies can be reduced by considering uniform 
aggregates of the symbols. Thus, if we count digrams, i.e., 
character pairs, instead of sjrrabols, we can reduce the greatest 
frequency by perhaps an order cf magnitude. The cost of tnis 
is an increase in the variety of the new symbol a considered, 
again by something xike an order of magnitude. As we consider 
longer uniform character strings, or fixed-length n-grams, we 
constantly reduce the range of frequencies, uut always with an 
accompanying inc-:ease in variety, as illustrated for INSPEC 
titles In Figure 4 for n « 1, 4 and 8. 

'V ariable-length character strings. 

Returning to the variable-to-fixed length compression 
approach, let us consider a simple method of ironing out 
unevenness in distribution without the great increase in 
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Figure 4« Rank frequency distribution for character 
strings of lengths 1. 4 and 9. 

variety cai^sed by taking uniform segments of text. 

We first produce ranked lists of characters, digrams, 
trigrams, etc., for a subdt.antial sample of text, as .'.llustrated 
in Table 1 for Chemical Titles, By adding th<* most frequent 
digram to the original sjrrabol set, we increase "^he variety by 
one, but reduce the frequencies of two of the most frequent 
characters quite substantially. We continue this by adding 
further digrams unt.Q we redch that digram with a frequency 
equal to or just below that of the most frequent trigram. We 
add this trigram in turn to the set. We continue the procedure, 
adding further n-grams of any length as their frequency equals 
that currently being considered, until the total number of 
"sjTnbols" in the new set equals some power of two, e.g., 256 

This now constitutes what we call a key-set, composed 
of variable-length strings, or keys. The majority of these are 
short, digrams or trigrems. To each is assigned a numeric code, 
which can be represented by 8 bits for a key-set size of 256. 
Obviously, the process can be continued by continual addition 
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of farther n-grams, so that the key-set can be enlarj^9d to 
any desired level. Table 1 illustrfe^oes the ranked n-grams for 
a sample of 1000 titles from Chemical Titles, 

n-cram 
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Table 1« Frequency-ranked n~grams from 1000 titles 

from Chemical Titles. 

Data-Compression* 

We now apply the key-set to text in the following 
manner. .Ve take the initial characters, select the longest key 
available which matches it exactly, and substitute it^s code 
for the string. Starting from the next character not included 
in the firs-o string, we repeat the process until the end of 
the text is reached. Figure 5 illustrates the process. 

VARIABLii-LIiNGTH STRINGS 
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Figure 5. Kncodement of text by variable-length 

charactei- strinfcs. 
Obviously, the key-set must contain all characters in the text 
Er|c ^® processed, no matter how infrequently some may appear, 

™" 51 
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V.'hen a code fo^^ a single character is used there may be an 

overall loss (depending on the fixed-length character code 

employed). When a longer n-gram is encoded, the number of bits 

required for storage is reduced, the advantage being the number 

of bits saved multiplied by the number of occapions c i which 

the n-gram is used. 

The method of selecting a key-set we have ;just outlined 

is a simple one, and ignores the fact that certain of the 

smaller keys are wholly contained within longer ones, and 

seldom if ever assigned. By elirainating these, and adding 

further n-graras from the csndidate list, performance can be 

appreciably increased. We have already described other methods 

7 8 

of generating key-sets by suitable programs * . Simple though 

the above method is, its performance is comparable with key~ 

sets produced automatically. 

We have now determined compression ratio? obtainable 

with automatically produced key-sets on titles from three 

different data-bases, at two key-set sizes, 256 and 512. The 

composition of a typical key-set with 25C n-grams is shown in 

Table 2, while liable 3 shows the compression ratios obtained 

with these key-sets. The figures represent the reduction in 

the numbers of bits required for storage, based on a 6-bit 

character code (ICL 1900 Series computer). If the character 

code were an 8-bit code, the advantage gained would be 

correspondingly greater, reaching approximately 50^;^ with the 

256 key-set. This would presume use of a single-case character 

set, and of a shift-code if a multiple-case alphabet were used. 

It is worth noting that Snyderman and Hunt^ and Schieber and 
10 

Thomas by adding digrams to the basic character code of 
IBM 360 machines have achieved compression ratios of 359^ and 

ERJC 43*55^, while Byrne and Mullaneyll have attained a ratio of 44% 

n 
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Table 2. Composition of key-set of 256 keys from INSPKC titles, 
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COMPRESSION RATIOS 
256 KEY-SET 
CT 33-99^ 
ASCA 31*7^ 
iNSPEC 33.9?^ 

512 KEY-SET 
CT 37*5?^ 
ASCA 35 '195 

INSPEC 37 -e?? 

Table 3, Compression ratios obtained with key-sets of size 256 

and 512 keys, (.ratios based on a 6 bit character coding). 
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based on an 0-bit code, also using an n-eram method. 

It is interesting to examine the extent to which the 
mopping of a hyperbolic distribution of characters onto a rect- 
ancular distribution has been achieved by this procedure. 
Figure 6 shows the shape of the rank-frequency distribution 
curve for the INSPEC 256 key-set applied to titles, plotted 
on a log/ log scale. The entropy value was calculated and found 



to be I'bQ, indicating that little further improvement of 
performance can be expected with this key-set size, although 
greater degrees of compression might well be obtained by using 
a variable-variable length strategy, the third mentioned above. 

Comparison of data-bases. 

Having earlier determined that key-sets produced from 

one data-base over a period of three years were substantially 

stable, we were interested in determining what similarities 

or differences existed between different data-bases. Those we 

chose were INSPEC, Chemical Titles and ASCA, representing two 

disciplinary data-bases and one interdisciplinary. Using 
ERIC key-sets containing 256 keys, 191 keys were fnund to be common 




Figure 6. Rank-fre quency curve (log/log) of distribut ion 



of n-gram keys. 
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to all throe, while pair-wise comparisons showed even greater 
similarities, i.e., 209 keys coninon to the sets fro.n CT and 
ASCA, 210 comnon to oH and iNaPKC, and 213 common to ASCA and 
INSPSC. Further confirmation of this similarity was obtained 
by using a key-set produced from one data-base for compression 
of another. As Table 4 illustrates, only slight reductions In 

256 Keys 

CT with INSPEC 31«9% (33.9%) 

INSPEC with ASCA 32*5% 1 35.9/0) 

ASCA with C'x 31'19?i (31«7%) 

512 Keyfs 

CT with INSPEC 34«3?S (37 '5%) 

INSPEC with ASCA 35 '8% (37«6?S) 

ASCA with CT '^3. 69b (35.I5S) 

Table 4. Comp ression ratios using key-sets derived 

from another data-base. 

the compression ratios were observed, indicating that as far 
as titles are concerned, the statistical microstructures of the 
data-bases are very similar indeed. This is in spite of 
substantial dissimilarities in the vocabularies of the data- 
bases; Table 5 gives the ranks of most frequent words in a 
sample of each data-base. (The word THE is automatically 
removed from titles in the ASCA file.) It is noticeable that 
discipline-oriented content words appear at higher ranks in 
INSPiiC and CT than in ASCA. 

Summary. 

The work we have described represents an extension of 
the strategies available for data-compression, . with potentially 
^ useful applications. It also provides an Information- theoretic 
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Table 5, Word rankings for samples of three data-bases, 
model which we believe can have considerable significance in 
the context of computer-based retrieval systems, on which we 
hope to report shortly. # 
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